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ABSTRACT 


vik  consider  univariate  nonparametric  regression.  Two  standard  nonpara- 

metric  regression  function  estimates  are  kernel  estimates  and  nearest  neighbor 

y 

estimates.  Macl^l981)  noted  that  both  methods  can  be  defined  with  respect 
to  a  kernel  or  weighting  function,  and  that  for  a  given  kernel  and  a  suitable 
choice  of  bandwidth,  the  optimal  mean  squared  error  is  the  same  asymptotically 
for  kernel  and  nearest  neighbor  estimates.  Yang^(W81)  defined  a  new  type  of 
nearest  neighbor  regression  estimate  using  the  empirical  distribution  function 
of  the  predictors  to  define  the  window  over  which  to  average.  This  has  the  effect 
of  forcing  the  number  of  neighbors  to  be  the  same  both  above  and  below  the 

value  of  the  predictor  of  interest;  we  call  these  symmetrized  nearest  neighbor 

a 

estimates.  The  estimate  is  a  kernel  regression  estimate  with  "predictors”  given 
by  the  empirical  disribution  function  u  the  true  predictors.  We  show  that  for 
estimating  the  regression  function  at  a  point,  the  optimum  mean  squared  error 
of  this  estimate  differs  from  that  of  the  optimum  mean  squared  error  for  kernel 
and  ordinary  nearest  neighbor  estimates.  No  estimate  dominates  the  others. 
They  are  asymptotically  equivalent  with  respect  to  mean  squared  error  if  one 
is  estimating  the  regression  function  at  a  mode  of  the  predictor. 

r  . 


Key  Words  and  Phrases:  Nonparametric  regression,  kernel  regression,  near¬ 
est  neighbor  regression,  bias,  mean  squared  error. 


Section  1:  Introduction 


We  consider  nonparametric  regression  with  a  random  univariate  predictor. 
Let  (X,  y )  be  a  bivariate  random  variable  with  joint  distribution  H,  and  denote 
the  regression  function  of  V  on  X  by  m{x)  =  E{Y  |  X  =  x).  If  it  exists,  let 
/.  denote  the  marginal  density  of  X.  A  sample  of  size  n  is  taken,  (y.-,Xi)  for 
t  =  1, n.  Two  common  estimates  of  the  regression  function  are  the  Nad2U'aya- 
Watson  kernel  estimate  and  the  nearest  neighbor  estimate,  see  Nadar aya  (1964), 
Watson  (1964)  and  Stute  (1984)  for  the  former,  and  Mack  (1981)  for  the  latter. 
Fix  Xo  and  suppose  we  wish  to  estimate  m(xo).  The  kernel  and  nearest  neighbor 
estimates  are  defined  as  follows.  Let  /T  be  a  nonnegative  even  density  function. 


Kernel  Estimates  Let  hter  be  a  bandwidth  depending  on  n.  Then  the 
kernel  estimate  is 


mfc„(xo)  = 


_ 

"•ker 


Nearest  Neighbor  Estimates  Let  k  —  k{n)  he  9.  sequence  of  positive  in¬ 
tegers,  and  let  Rn  be  the  Euclidean  distance  between  Xq  and  its  kih  nearest 
neighbor.  Then  the  nearest  neighbor  estimate  is 


f>^kNU  (®o)  — 


Under  differentiability  conditions  on  the  marginal  density  /. ,  Mack  has  shown 
that  the  a83rptotically  optimal  versions  of  the  kernel  and  nearest  neighbor  esti¬ 
mates  have  the  same  behavior.  Let  and  denote  the  jth  derivative  of  m 

and  /,  respectively.  If  Cjr  =  /  X^(x)dx  and  djc  =  /  x’X(i)dx,  remembering 
that  K  is  symmetric,  the  kernel  estimate  has  bias 


and  variance 


(4)  w<*ri„  =  eKVar{Y  \  X  =  Xo)/(nhfc„/.(xo))  +  o((nhfc„)"‘). 
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There  is  obviously  a  bias  versus  variance  tradeoff  here,  so  that  if  one  wants 
to  achieve  the  minimum  mean  squared  error,  the  optimal  bandwidth  is  htcr  ~ 
and  the  optimal  mean  squared  error  is  of  order  The  formulae 

for  bias  and  variance  of  the  A;th  nearest  neighbor  estimate  are  the  same  as  in 
(3)  and  (4)  if  one  substitutes  2f,{xo)nhktr  for  k. 

Let  F  denote  the  distribution  function  of  X,  and  let  denote  the  empirical 
distribution  of  the  sample  from  X.  Let  be  a  bandwidth  tending  to  zero. 
The  estimate  proposed  by  Yang  (1981)  and  studied  by  Stute  (1984)  is 

The  nearest  neighbor  estimate  defines  neighbors  in  terms  of  the  Euclidean  norm, 
which  in  this  case  is  just  absolute  difference.  The  estimate  (5)  is  also  a  nearest 
neighbor  estimate,  but  now  neighbors  are  defined  in  terms  of  distance  based 
on  empirical  distribution  function.  This  makes  for  computational  efficiency  if 
the  uniform  kernel  is  used.  A  direct  application  of  (5)  would  result  in  0{n'^h) 
operations,  but  using  updating  as  the  window  moves  over  the  span  of  the  x’s 
results  in  0(n)  operations.  Other  smooth  kernels  can  be  computed  efficiently 
by  iterated  smoothing,  i.e.,  higher  order  convolution  of  the  uniform  kernel.  An¬ 
other  possible  device  is  the  Fast  Fourier  transform  (Hardle,  1987).  Since  the 
difference  between  (2)  and  (5)  is  that  (5)  picks  its  neighbors  symmetrically, 
we  call  it  a  symmetrized  nearest  neighbor  estimate.  Note  that  always 

averages  over  a  symmetric  neighborhood  in  the  x-space,  but  may  have  an  asym¬ 
metric  distribution  of  x  points  in  this  neighborhood.  By  contrast,  always 
averages  over  the  same  amount  of  points  left  and  right  of  Xq,  but  may  in  ef¬ 
fect  average  over  an  asymmetric  neighborhood  in  the  x-space.  The  estimate 
has  an  intriguing  relationship  with  the  k-NN  estimator  used  by  Friedman 
(1986).  The  variable  span  smoother  proposed  by  Friedman  uses  the  same  type 
of  neighborhood  as  does  m.„„  and  is  tised  as  an  elementary  building  block  for 
ACE,  see  Breiman  and  Friedman  (1985).  The  estimate  (5)  also  looks  appeal¬ 
ingly  like  a  kernel  regression  estimate  of  Y  against  not  X  but  rather  F'n(X). 
Define 

(6)  f  m(z)/ir( ^ )<fx. 

J 

Then  Stute  shows  that  as  as  n  — ♦  oo,  h,__  — »  0  and  nh*  _  — »  oo, 

(7)  (»»h,,,)‘^*(ifi„.„(xo)  -»?i,„„(io))  =►  Normal{0,CKVar{Y  |  X  =  Xo)). 

This  has  the  form  (4)  as  long  as  =  hfc,,/.(xo).  With  this  choice  of 
Stute's  estimate  has  the  same  limit  properties  as  a  kernel  or  ordinary  nearest 


(8) 


neighbor  estimate  as  long  as  its  bias  term  satisfies  (3).  State  shows  that  the 
bias  is  of  order  although  he  does  not  give  an  asymptotic  formulae.  It 

is  in  fact  easy  to  show  that  the  bias  satisfies  to  order  o(h^^^) 

2/.M*o) 

Comparison  of  (3)  and  (8)  shows  that  even  when  the  variances  of  all  three 
estimates  are  the  same  (the  case  =  hk«r/«  (so))i  the  bias  properties  differ 
unless 

=0. 

Otherwise,  the  optimal  choice  of  bandwidth  for  the  kernel  and  ordinary  nearest 
neighbor  estimates  will  lead  to  a  different  mean  squared  error  than  what  obtains 
for  the  symmetrized  nearest  neighbor  estimate. 

The  preceeding  discussion  presumed  that  we  are  interested  in  estimating 
the  regression  function  only  at  the  point  Xq  and  that  bandwidth  was  chosen 
locally  so  as  to  minimize  asymptotic  mean  squared  error.  In  practice,  one  is 
usually  interested  in  the  regression  curve  over  an  interval,  and  the  bandwidth  is 
chosen  globally,  see  for  example  Hardle,  Hall  and  Marron  (1988).  Inspection  of 
(3),  (4)  and  (8)  shows  the  usual  tradeoff  between  kernel  and  nearest  neighbor 
estimates:  in  the  tails  of  the  distribution  of  z,  the  former  are  more  variable  but 
less  biased. 

The  symmetrized  nearest  neighbor  estimate  is  a  kernel  estimate  based  on 
transforming  the  z  data  by  F„ .  Other  transformations  are  possible,  e.g.,  log(x). 
In  general,  if  we  transform  by  to  =  C(z),  if  m.(u;)  =  m(z)  and  w  has  density 
/« ,  then  the  bias  and  variance  properties  of  the  resulting  kernel  estimate  are 
given  by  (3)-(4)  in  m.  and  /. ,  the  translation  to  /,  and  m  being  immediate  by 
the  chain  rule. 


An  Example 


For  illustrative  purposes  we  use  a  large  data  set  (n=7125)  of  the  rela¬ 
tionship  of  y  =  expenditure  for  potatoes  versus  X  =  net  income  of  British 
households  (in  tenth  of  a  pence)  in  1973.  The  data  come  from  the  Family  Ex¬ 
penditure  Survey,  Annual  Base  Tapes  1968-198S,  Department  of  Employment, 
Statistics  Division,  Her  Majesty's  Stationary  Office,  London,  and  were  made 
avmlable  by  the  ESRC  Data  Archive  at  the  University  of  Essex.  See  Hardle 
(1988,  Chapter  1)  for  a  discussion  .  For  these  data,  we  used  the  quartic  kernel 

jf(«)  =  j|(i-t.’)’/{|.|<  1). 
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We  computed  the  ordinary  kernel  estimate  (1)  and  the  symmetrized  nearest 
neighbor  estimate  (5),  the  bandwidths  being  selected  by  crossvaiidation,  see 
Hardle  and  Marron  (1985).  The  crossvaiidated  bandwidths  were  =  0.25 
on  the  scale  (0,3)  of  Figure  1  and  =  0.15  on  the  scale.  The  resulting 
r^ression  curves  are  plotted  in  Figure  1.  The  two  curves  are  similar  for  x  <  1, 
which  is  where  most  of  the  data  lie.  There  is  a  sharp  discrepancy  for  larger 
values  of  z,  the  kernel  estimate  showing  evidence  of  a  bimodal  relationship  and 
the  symmetrized  nearest  neighbor  estimate  indicating  either  an  asymptote  or 
even  a  slight  decrease  as  income  rises.  In  the  context,  the  latter  seems  to  maJce 
more  sense  economically  and  looks  quite  similar  to  to  curve  in  Hildenbramd 
and  Hildenbrand  1986).  Statistically,  it  is  in  this  range  of  the  data  that  the 
density  takes  on  small  values,  which  is  exactly  when  we  expect  the  biggest 
differences  in  the  estimates,  i.e.,  the  kernel  estimate  should  be  more  variable 
but  less  biased. 
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